{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## Get started\n",
    "\n",
    "We first initialize the [sbert operator](https://towhee.io/sentence-embedding/sbert), which takes a sentence or a list of sentences in string as input. It generates an embedding vector in numpy.ndarray for each sentence, which captures the input sentence's core semantic elements.\n",
    "Then, we fine-tune operator in Semantic Textual Similarity (STS) task, which assigns a score on the similarity of two texts. We use the [STSbenchmark](https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) as training data to fine-tune.\n",
    "We only need to construct an operator instance and pass in some configurations to train the specified task."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import towhee\n",
    "import os\n",
    "from sentence_transformers import util\n",
    "\n",
    "op = towhee.ops.sentence_embedding.sbert(model_name='nli-distilroberta-base-v2').get_op()\n",
    "\n",
    "sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'\n",
    "\n",
    "if not os.path.exists(sts_dataset_path):\n",
    "    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)\n",
    "\n",
    "\n",
    "model_save_path = './output'\n",
    "training_config = {\n",
    "    'sts_dataset_path': sts_dataset_path,\n",
    "    'train_batch_size': 16,\n",
    "    'num_epochs': 4,\n",
    "    'model_save_path': model_save_path\n",
    "}\n",
    "op.train(training_config)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Load trained weights"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "### You just need to init a new operator with the trained folder under `model_save_path`.\n",
    "model_path = os.path.join(model_save_path, os.listdir(model_save_path)[-1])\n",
    "new_op = towhee.ops.sentence_embedding.sbert(model_name=model_path).get_op()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Dive deep and customize your training\n",
    "You can change the [training script](https://towhee.io/sentence-embedding/sbert/src/branch/main/train_sts_task.py) in your custom way. Or you can refer to the original [sbert training guide](https://www.sbert.net/docs/training/overview.html) and [code example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) for more information."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}